EDA - Exploratory Data Analysis!¶

Analysis of data saved in the file student_lifestyle_dataset.csv downloaded from the Kaggle website.¶

This dataset, titled "Daily Lifestyle and Academic Performance of Students", contains data from 2,000 students collected via a Google Form survey. It includes information on study hours, extracurricular activities, sleep, socializing, physical activity, stress levels, and CGPA. The data covers an academic year from August 2023 to May 2024 and reflects student lifestyles primarily from India. This dataset can help analyze the impact of daily habits on academic performance and student well-being.¶

  • File Format: CSV
  • File Name: Daily_Lifestyle_and_Academic_Performance.csv
  • Number of Records: 2000 rows
  • Number of Columns: 8 columns
  • Column Names: Student ID, Study Hours, Extracurricular Hours, Sleep Hours, Social Hours, Physical Activity Hours, Stress Level, GPA
  • File Size: Approximately 150 KBsl 150 KB

Student project.¶

In [7]:
# import of necessary libraries
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
In [8]:
# defining student_df variable and reading the csv file into the DataFrame
student_df = pd.read_csv('student_lifestyle_dataset.csv', sep=",")
In [9]:
student_df
Out[9]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
0 1 6.9 3.8 8.7 2.8 1.8 2.99 Moderate
1 2 5.3 3.5 8.0 4.2 3.0 2.75 Low
2 3 5.1 3.9 9.2 1.2 4.6 2.67 Low
3 4 6.5 2.1 7.2 1.7 6.5 2.88 Moderate
4 5 8.1 0.6 6.5 2.2 6.6 3.51 High
... ... ... ... ... ... ... ... ...
1995 1996 6.5 0.2 7.4 2.1 7.8 3.32 Moderate
1996 1997 6.3 2.8 8.8 1.5 4.6 2.65 Moderate
1997 1998 6.2 0.0 6.2 0.8 10.8 3.14 Moderate
1998 1999 8.1 0.7 7.6 3.5 4.1 3.04 High
1999 2000 9.0 1.7 7.3 3.1 2.9 3.58 High

2000 rows × 8 columns

General overview of the data.¶

In [75]:
student_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Student_ID                       2000 non-null   int64  
 1   Study_Hours_Per_Day              2000 non-null   float64
 2   Extracurricular_Hours_Per_Day    2000 non-null   float64
 3   Sleep_Hours_Per_Day              2000 non-null   float64
 4   Social_Hours_Per_Day             2000 non-null   float64
 5   Physical_Activity_Hours_Per_Day  2000 non-null   float64
 6   GPA                              2000 non-null   float64
 7   Stress_Level                     2000 non-null   object 
dtypes: float64(6), int64(1), object(1)
memory usage: 125.1+ KB
In [76]:
student_df.head() # displaying initial values
Out[76]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
0 1 6.9 3.8 8.7 2.8 1.8 2.99 Moderate
1 2 5.3 3.5 8.0 4.2 3.0 2.75 Low
2 3 5.1 3.9 9.2 1.2 4.6 2.67 Low
3 4 6.5 2.1 7.2 1.7 6.5 2.88 Moderate
4 5 8.1 0.6 6.5 2.2 6.6 3.51 High
In [77]:
student_df.tail() # displaying final values
Out[77]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
1995 1996 6.5 0.2 7.4 2.1 7.8 3.32 Moderate
1996 1997 6.3 2.8 8.8 1.5 4.6 2.65 Moderate
1997 1998 6.2 0.0 6.2 0.8 10.8 3.14 Moderate
1998 1999 8.1 0.7 7.6 3.5 4.1 3.04 High
1999 2000 9.0 1.7 7.3 3.1 2.9 3.58 High
In [78]:
student_df.sample(15) # displaying 15 random records
Out[78]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
171 172 5.1 0.3 6.0 0.2 12.4 2.70 Low
326 327 5.3 4.0 5.3 4.2 5.2 2.67 High
60 61 8.3 1.1 9.8 3.5 1.3 3.67 High
1663 1664 5.4 2.4 8.0 2.5 5.7 2.77 Low
595 596 6.0 3.5 8.7 0.1 5.7 2.95 Moderate
1349 1350 7.9 2.5 7.8 4.3 1.5 3.19 Moderate
1204 1205 8.4 1.1 6.4 5.6 2.5 3.37 High
1148 1149 9.9 2.0 7.3 3.6 1.2 3.51 High
34 35 9.7 0.6 6.7 0.7 6.3 3.62 High
318 319 9.0 1.2 5.4 2.4 6.0 3.46 High
840 841 5.8 0.5 5.6 3.1 9.0 2.73 High
134 135 6.3 2.7 8.8 3.6 2.6 2.45 Moderate
131 132 5.6 2.4 6.4 3.5 6.1 2.85 Low
1628 1629 7.0 2.7 5.3 3.6 5.4 3.17 High
376 377 9.4 3.4 5.9 4.3 1.0 3.18 High
In [79]:
student_df.describe() # displaying statistics for numeric columns
Out[79]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA
count 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.00000 2000.000000
mean 1000.500000 7.475800 1.990100 7.501250 2.704550 4.32830 3.115960
std 577.494589 1.423888 1.155855 1.460949 1.688514 2.51411 0.298674
min 1.000000 5.000000 0.000000 5.000000 0.000000 0.00000 2.240000
25% 500.750000 6.300000 1.000000 6.200000 1.200000 2.40000 2.900000
50% 1000.500000 7.400000 2.000000 7.500000 2.600000 4.10000 3.110000
75% 1500.250000 8.700000 3.000000 8.800000 4.100000 6.10000 3.330000
max 2000.000000 10.000000 4.000000 10.000000 6.000000 13.00000 4.000000

Preliminary observations.¶

According to the analyzed data set, students study on average 7 hours a day. The maximum learning time is 10 hours a day.¶

The maximum amount of time allocated to extracurricular activities is 4 hours per day.¶

Students sleep on average about 7 hours, with the shortest sleep time being 5 hours and the longest 10 hours. On average, social hours are over 2 hours.¶

They spend an average of 4 hours a day on physical activity, with a maximum of 13 hours.¶

The highest GPA is 4 and the lowest is 2.24¶

Missing value analysis.¶

In [80]:
student_df.isnull().sum()
Out[80]:
Student_ID                         0
Study_Hours_Per_Day                0
Extracurricular_Hours_Per_Day      0
Sleep_Hours_Per_Day                0
Social_Hours_Per_Day               0
Physical_Activity_Hours_Per_Day    0
GPA                                0
Stress_Level                       0
dtype: int64
In [81]:
student_df[student_df.duplicated()] # displaying duplicates
Out[81]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level

The analyzed set has no missing values ​​and no duplicates.¶

Single variable analysis.¶

In [82]:
fig = px.histogram(
    student_df,
    x="Study_Hours_Per_Day",
    title="Schedule_Study_Hours_Per_Day",
    width=900,
    height=500,
)
fig.show()        # histograms for individual columns
In [83]:
sns.displot(data=student_df, x="Study_Hours_Per_Day", col = "Stress_Level", kde=True)  #distribution of numerical variables divided into stress levels
Out[83]:
<seaborn.axisgrid.FacetGrid at 0x226c7697b90>
No description has been provided for this image
In [84]:
fig = px.histogram(
    student_df,
    x="Extracurricular_Hours_Per_Day",
    title="Schedule_Extracurricular_Hours_Per_Day",
    width=700,
    height=400,
)
fig.show()
In [85]:
sns.displot(data=student_df, x="Extracurricular_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[85]:
<seaborn.axisgrid.FacetGrid at 0x226c97c1910>
No description has been provided for this image
In [86]:
fig = px.histogram(
    student_df,
    x="Sleep_Hours_Per_Day",
    title="Schedule_Sleep_Hours_Per_Day",
    width=800,
    height=500,
)
fig.show()
In [87]:
sns.displot(data=student_df, x="Sleep_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[87]:
<seaborn.axisgrid.FacetGrid at 0x226c964c310>
No description has been provided for this image
In [88]:
fig = px.histogram(
    student_df,
    x="Social_Hours_Per_Day",
    title="Schedule_Social_Hours_Per_Day",
    width=800,
    height=500,
)
fig.show()
In [89]:
sns.displot(data=student_df, x="Social_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[89]:
<seaborn.axisgrid.FacetGrid at 0x226c96006d0>
No description has been provided for this image
In [90]:
fig = px.histogram(
    student_df,
    x="Physical_Activity_Hours_Per_Day",
    title="Schedule_Physical_Activity_Hours_Per_Day",
    width=800,
    height=500,
)
fig.show()
In [91]:
sns.displot(data=student_df, x="Physical_Activity_Hours_Per_Day", col = "Stress_Level", kde=True) 
Out[91]:
<seaborn.axisgrid.FacetGrid at 0x226c7aed750>
No description has been provided for this image
In [92]:
fig = px.histogram(
    student_df,
    x="GPA",
    title="Schedule_GPA",
    width=800,
    height=500,
)
fig.show()
In [93]:
sns.displot(data=student_df, x="GPA", col = "Stress_Level", kde=True) 
Out[93]:
<seaborn.axisgrid.FacetGrid at 0x226c8d87b90>
No description has been provided for this image
In [94]:
fig = px.histogram(
    student_df,
    x="Stress_Level",
    title="Schedule_Stress_Level",
    width=800,
    height=500,
)
fig.show()
In [95]:
student_df['Stress_Level'].value_counts()
Out[95]:
High        1029
Moderate     674
Low          297
Name: Stress_Level, dtype: int64
In [96]:
student_df[student_df['GPA']==4.0] # displaying the student with the highest GPA
Out[96]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
51 52 9.0 2.6 8.5 3.1 0.8 4.0 High
In [97]:
student_df[student_df['GPA']==2.24] # displaying the student with the lowest GPA
Out[97]:
Student_ID Study_Hours_Per_Day Extracurricular_Hours_Per_Day Sleep_Hours_Per_Day Social_Hours_Per_Day Physical_Activity_Hours_Per_Day GPA Stress_Level
764 765 5.5 1.8 6.7 5.2 4.8 2.24 Low

Short observations.¶

Most of the surveyed students are characterized by high levels of stress. Low stress levels occur only in 297 students out of 2,000 analyzed.¶

Low-stress students spent about 5-6 hours a day studying. In the group with a moderate level of stress, we notice the period devoted to learning - over 5 hours, but less than 9.¶

However, in the group of students with a high level of stress, this range is wide - from 5 to 10 hours a day for studying. However, most of them study for about 9 hours. Only in the group of people with high levels of stress do we observe the amount of sleep less than 6 hours.¶

In the group of students with high levels of stress, we observe that most of them spend less than 6 hours sleeping.¶

Analysis of relationships between variables.¶

In [98]:
plt.scatter('Study_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Study_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between study hours and grade point average.')
plt.show()
No description has been provided for this image

The longer the time spent studying, the higher the average grade.¶

In [144]:
sns.relplot(
    data=student_df,
    x="GPA",
    y="Study_Hours_Per_Day",
    col="Stress_Level",
    hue="Stress_Level",
)
Out[144]:
<seaborn.axisgrid.FacetGrid at 0x1776f2bbc90>
No description has been provided for this image
In [99]:
sns.lmplot(data=student_df, x="GPA", y="Study_Hours_Per_Day", col="Stress_Level", hue="Stress_Level")
Out[99]:
<seaborn.axisgrid.FacetGrid at 0x226c7697390>
No description has been provided for this image
In [100]:
sns.jointplot(data=student_df, x="Study_Hours_Per_Day", y="GPA", hue="Stress_Level")
Out[100]:
<seaborn.axisgrid.JointGrid at 0x226c8c03b90>
No description has been provided for this image
In [101]:
fig = px.scatter(student_df, x="GPA", y="Extracurricular_Hours_Per_Day")
fig.update_layout(
    title="Relationship between extracurricular activities and grade point average",
    xaxis_title="GPA",
    yaxis_title="Extracurricular_Hours_Per_Day",
)
fig.show()
In [102]:
plt.scatter('Sleep_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Sleep_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between sleep hours and grade point average.')
plt.show()
No description has been provided for this image
In [103]:
sns.relplot(
    data=student_df,
    x="GPA",
    y="Sleep_Hours_Per_Day",
    col="Stress_Level",
    hue="Stress_Level",
    
)
Out[103]:
<seaborn.axisgrid.FacetGrid at 0x226c71dbc90>
No description has been provided for this image

In the group of students with high levels of stress, we notice that low average grades go hand in hand with short sleep time.¶

In [104]:
sns.relplot(
    data=student_df,
    x="Sleep_Hours_Per_Day",
    y="Study_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    size="GPA",
)
Out[104]:
<seaborn.axisgrid.FacetGrid at 0x226c6d65150>
No description has been provided for this image

In the group of students with high levels of stress, average grades increase with more time spent studying and sleeping.¶

For low-stress students, we see no apparent change in GPA growth with more sleep.¶

In [105]:
plt.scatter('Social_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Social_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between social activity hours and grade point average.')
plt.show()
No description has been provided for this image
In [106]:
sns.relplot(
    data=student_df,
    x="Study_Hours_Per_Day",
    y="Social_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    #size="Stress_Level",
)
Out[106]:
<seaborn.axisgrid.FacetGrid at 0x226c67d5010>
No description has been provided for this image
In [107]:
plt.scatter('Physical_Activity_Hours_Per_Day', 'GPA',  data=student_df)
plt.xlabel('Physical_Activity_Hours_Per_Day')
plt.ylabel('GPA')
plt.title('Relationship between physicial activity hours and grade point average.')
plt.show()
No description has been provided for this image
In [108]:
sns.jointplot(data=student_df,  x="Social_Hours_Per_Day", y="Physical_Activity_Hours_Per_Day", hue="Stress_Level")
Out[108]:
<seaborn.axisgrid.JointGrid at 0x226c5e63b90>
No description has been provided for this image
In [109]:
sns.relplot(
    data=student_df,
    x="Social_Hours_Per_Day",
    y="Physical_Activity_Hours_Per_Day",
    col="Stress_Level",
    hue="GPA",
    size="GPA",
)
Out[109]:
<seaborn.axisgrid.FacetGrid at 0x226a8629450>
No description has been provided for this image

We observe mainly in the group of students with low and moderate stress that as the hours devoted to activity increase, the amount of social time decreases. However, in a group with a high level of stress and more time devoted to physical and social activities, the lower the GPA.¶

In [110]:
sns.relplot(
    data=student_df,
    kind="line",
    x="Sleep_Hours_Per_Day",
    y="Physical_Activity_Hours_Per_Day",
    style="Stress_Level",
    hue="Stress_Level",
    
)
Out[110]:
<seaborn.axisgrid.FacetGrid at 0x226c3939990>
No description has been provided for this image
In [111]:
correlation_df = student_df[['Study_Hours_Per_Day','Extracurricular_Hours_Per_Day','Sleep_Hours_Per_Day', 'Social_Hours_Per_Day', 'Physical_Activity_Hours_Per_Day', 'GPA',]].corr()
In [112]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_df, annot=True, cmap='coolwarm', linewidths=.5)

plt.title('Heatmap showing correlations between variables')
plt.show()
No description has been provided for this image

There is a noticeable correlation between the time spent studying and the average grade.¶

In [113]:
sns.pairplot(data=student_df, hue="Stress_Level")
Out[113]:
<seaborn.axisgrid.PairGrid at 0x226c38a1c10>
No description has been provided for this image

Outlier analysis.¶

In [114]:
student_df.groupby('Stress_Level').plot(kind='box', figsize=(20,8), grid=True)
Out[114]:
Stress_Level
High        Axes(0.125,0.11;0.775x0.77)
Low         Axes(0.125,0.11;0.775x0.77)
Moderate    Axes(0.125,0.11;0.775x0.77)
dtype: object
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Outliers appear in GPA, Study Hours and Physical Activity Hours.¶

In [115]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level', y='GPA', hue='Stress_Level')


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('GPA', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image
In [116]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level',  y='Physical_Activity_Hours_Per_Day', hue='Stress_Level')


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Physical_Activity_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image
In [117]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=student_df, x='Stress_Level',  y='Study_Hours_Per_Day', hue="Stress_Level" )


plt.title('Box plot', fontsize=16)
plt.xlabel('Stress_Level', fontsize=12)
plt.ylabel('Study_Hours_Per_Day', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()


plt.show()
No description has been provided for this image

Analysis summary.¶

We observe a growing relationship between study hours and grade point average. The longer the learning period, the higher the grade and the increase in stress level.¶

Students in the low-stress, low-study group achieve average GPAs of 2.25 to 3.50.¶

We do not observe a significant relationship between extracurricular activities and grade point average.¶

Among students with low levels of stress, lower GPAs occur regardless of sleep amount. These are values ​​up to a maximum of 3.5 GPA. In the group of students with moderate stress, GPA is higher on average regardless of sleep time, but in this group the time spent studying is greater than in students with low stress.¶

In turn, in the group of students with high levels of stress, values ​​above 3.9 GPA appear, but only when they study for longer than 8 hours. We do not observe a relationship between sleep and GPA, as after 8 hours of study we observe high average grades both with approximately 5 hours of sleep and after 9 hours of sleep.¶

It can be noticed that as the time spent on physical activity increases, the number of hours spent on social life decreases.¶

The question arises whether grades at university are really more important than our health and comfort of life. Can it be said that people with higher academic results but less physical and social activity are satisfied? Do people with lower stress levels but greater social and physical activity lose something by achieving a lower grade point average?¶